APRIORI APPROACH TO GRAPH-BASED CLUSTERING OF TEXT DOCUMENTS by Mahmud

نویسندگان

Shahriar Hossain

Mahmud Shahriar Hossain

Rafal A. Angryk

John Paxton

Carl A. Fox

چکیده

This thesis report introduces a new technique of document clustering based on frequent senses. The developed system, named GDClust (Graph-Based Document Clustering) [1], works with frequent senses rather than dealing with frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and uses an Apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. We propose a novel multilevel Gaussian minimum support strategy for candidate subgraph generation. Additionally, we introduce another novel mechanism called Subgraph-Extension mining that reduces the number of candidates and overhead imposed by the traditional Apriori-based candidate generation mechanism. GDClust utilizes an English language thesaurus (WordNet [2]) to construct document-graphs and exploits graph-based data mining techniques for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Plagiarism Detection Considering Frequent Senses Using Graph Based Research Document Clustering

A new, graph based research document clustering technique (GRD-Clust) is introduced based on frequent senses rather than frequent keywords as per the traditional document clustering techniques.GRDClust presents text documents as hierarchal document-graphs and utilizes an Apriori paradigm to find the frequent sub graphs, which reflect frequent senses based on support and confidence. We highlight...

متن کامل

Clustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets

Document Clustering is one of the main themes in text mining. It refers to the process of grouping documents with similar contents or topics into clusters to improve both availability and reliability of text mining applications. Some of the recent algorithms address the problem of high dimensionality of the text by using frequent termsets for clustering. Although the drawbacks of the Apriori al...

متن کامل

Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

The vast amount of textual information available in electronic form is growing at a staggering rate in recent times. The task of mining useful or interesting frequent itemsets (words/terms) from very large text databases that are formed as a result of the increasing number of textual data still seems to be a quite challenging task. A great deal of attention in research community has been receiv...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

APRIORI APPROACH TO GRAPH-BASED CLUSTERING OF TEXT DOCUMENTS by Mahmud

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Plagiarism Detection Considering Frequent Senses Using Graph Based Research Document Clustering

Clustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets

Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

عنوان ژورنال:

اشتراک گذاری